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ABSTRACT 



A common procedure for obtaining multiple readings (ratings) 
for a constructed response item, especially in high-stakes tests, is to have 
two readers read the papers independently, with a third reading if the 
results differ by more than one point. This necessitates a scoring rule that 
specifies how the ratings will be aggregated into a single item score. Two 
plausible scoring rules involve averaging the readings and rounding either to 
the nearest half point or the nearest integer, but it is not known which 
results in a greater precision of measurement. This study investigated the 
precision and accuracy of ability estimates obtained under the two scoring 
rules for mixed format tests calibrated under an item response theory model. 
Eleventh-grade reading, mathematics, and science test results and a 
fifth-grade mathematics test result were analyzed, with more than 1,200 
students available for each form. There was little substantive difference in 
score information or the standard errors of ability estimates due to the type 
of rounding (integer versus half point) , above the floors of three of the 
four tests, but in the fourth (11th grade reading) there was less error in 
the integer- rounded ability estimates at the lower portion of the scale. 
Integer- rounded estimates generally produce slightly larger predicted percent 
of maximum (test) scores, though not throughout the entire ability range of 
all the four tests studied. The expected larger positive differences or 
rounding bias for number correct estimates were observed. Within- subject 
differences between scale score estimates derived using integer versus 
half-point scores were generally small for both pattern and number correct 
ability estimates. The lack of substantive improvement in measurement 
precision that could be attributed to half -point rounding, coupled with the 
documented instance of increased error induced by that type of rounding in a 
portion of the ability range of students taking one test, would seem to argue 
for rounding average ratings to the nearest integer. Rounding up gives the 
preponderance of students the benefit of the doubt concerning the 
acceptability of their responses. (Contains two tables, four figures, and 
eight references.) ( SLD) 
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INTRODUCTION 



A common procedure for obtaining multiple readings (ratings) 
for a constructed response (c.r.) item, particularly those c.r. 
items in tests used to make high-stakes decisions, is to have two 
readers independently read the papers with a third independent 
reading acquired if the ratings differ by more than one point. 

The presence of two or three readings of a response to a c.r. item 
necessitates a scoring rule that specifies how the ratings 
(readings) will be aggregated into a single item score. 

Two plausible scoring rules involve averaging the two (or 
three) item ratings and rounding either to the nearest half point 
or to the nearest integer. Both rules are compatible with tests 
containing multiple item types (mixed-format tests incorporating 
multiple choice (m.c. }and c.r. items) that are scaled using a 
generalized IRT model incorporating a three-parameter logistic 
model (3pl) for the m.c. items and a two-parameter partial credit 
model (2ppc) for the c.r. items. This 3pl/2ppc type of 
generalized IRT model has been shown to better fit items in mixed- 
format tests (Fitzpatrick, Link, Yen, Burket, Ito & Sykes, 1996). 

It is not known whether increasing the number of levels by 
rounding to the nearest half point results in greater precision of 
measurement than rounding to the nearest integer. Half point item 
scores would reflect rater disagreement. A potential drawback to 
rounding to the nearest integer is that for the majority of a 
student's responses to the c.r. items only two readings will be 
necessary. A one point disagreement would always result in an 
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average item score that is rounded up, introducing varying degrees 
of positive "rounding" bias into the total raw scores. 

The precision or reliability of "half-point round" versus 
"integer-round" c.r. item scores on ability estimates can be 
assessed by the evaluation of information functions for the 
composite test scores to which they contribute. The reciprocal of 
the information function for a composite score at a particular 
ability level is the standard error of ability estimate (s.e.). 

If the use of half score points increases measurement precision 
(reduces error) , test scores utilizing them should demonstrate 
lower s.e.s across at least portions of the ability range. 

Differences in information may result from differences in how 
ability estimates weight component item scores. The weighting by 
item discrimination associated with pattern scoring, which 
utilizes the examinee's pattern of responses to the items, allows 
an item's contribution to an ability estimate to vary relative to 
the degree to which item scores are associated with ability. 
Conversely number-correct scoring, by considering one item point 
or level to be as good as any other, requires that each point 
contribute equally to the total score and derived ability 
estimate. 

The degree of rounding bias that is incurred by rounding the 
average readings for a c.r. item to the nearest integer may be 
evaluated by comparing the test characteristic curves (tecs) for 
ability estimates derived using half-point-rounded scores with 
tecs obtained using integer-rounded scores. When the tecs are 
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obtained by scoring a single sample both ways, differences in the 
expected number correct or predicted percentage of maximum score 
(predicted pm) for integer-rounded c.r scores relative to the 
prediction of the tcc for half-point-rounded scores reflects the 
bias or inaccuracy due to type of rounding. 

The purpose of this research was to investigate the precision 
and accuracy of ability estimates obtained under the two scoring 
rules for mixed-format tests calibrated with a 3pl/2ppc IRT 
generalized model. Test information and tecs obtained through the 
application of the two scoring rules were compared for each of two 
types of ability estimates: pattern and number-correct scores. 
Additionally within-subj ect differences in examinees' scaled 
scores were evaluated for signs that subsets of examinees may be 
substantively advantaged or disadvantaged by the manner of 
rounding employed. 

METHOD 

Source Data 

Mixed-format pilot (operational forms undergoing a final pre- 
operational administration) eleventh grade Reading, Math, and 
Science forms and a single tryout fifth grade Math form were 
available from two testing programs. The number of scored items 
of each type and the range in the number of levels (including 0) 
of the c.r. items are summarized below: 




5 



Range in Number of Levels 
of C.R. Items 



Content 

Area 


Grade 


Multiple 

Choice 


Constructed 

Response 


Half-Point- 

Rounded 


Integer' 

Rounded 


Reading 


11 


35 


1 


(7 - 7) 


(4 - 4) 


Math 


11 


40 


6 


(5 - 11) 


(3 - 6) 


Science 


11 


42 


8 


(5 - 7) 


(3 - 4) 


Math 


5 


49* 


11 


(5 - 9) 


(3 - 5) 



Eighteen of the 49 items denoted as multiple choice for the 
Math/Grade 5 test were actually gridded response items. Although 
similar to the 11 three to five level c.r items in their being 
scaled with a partial credit model, they are not considered c.r. 
items for the purpose of this study because their scores do not 
involve ratings or their averaging. More than 1200 students were 
available for each form. 

Rating Process 

Each c.r. item in the four tests was scored by at least two 
readers. If the readers' scores differed by more than one point, 
a third rating was obtained. Half-point-rounded scores were 
obtained by averaging the two or three ratings for an item and 
rounding to the nearest half point. Integer-rounded c.r. item 
scores resulted from rounding the average rating to the nearest 
integer . 

The implemented rating process resulted in the production of 
four meaningful kinds of averages. An average score equal to an 
integer could occur with either two or three readings. A second 
kind of average consisted of a score with a remainder of H. when 
two readers disagreed by a single point. The final two kinds or 
types of averages occurred when the average of three readings had 
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a remainder of 1/3 or 2/3. 



Averages with a remainder of H or 2/3 would be rounded up to 
the next integer with integer-rounding (e.g. 2.5 or 2.67 to 3.0). 
An average with a remainder of 1/3 would be reduced to the lower 
integer with this type of rounding. Average scores with any of 
the three possible remainders would be rounded to the half point 
with half-point rounding. 

Readers for both testing programs were trained to implement 
scoring rubrics; anchor papers, check sets, and read behinds were 
employed to verify and maintain scoring accuracy. Inter-rater 
reliability studies that incorporated second reads for a large 
sample of students taking each test indicated that the percentage 
of exact agreement on the 15 c.r. items in the three eleventh- 
grade tests ranged from 68% to 93%. Minimum and maximum exact 
agreement rates of 51% and 97% were obtained in a similar manner 
for the 11 c.r. items in the fifth grade Math test. Approximate 
agreement (within one point) ranged between 89% and 100% across 
the c.r. items in all four tests. 

Scaling Process 

Multiple-choice and open-ended items were scaled together 
twice using the generalized IRT model. With the generalized 
model a three-parameter logistic model (Lord, 1980) was used for 
the multiple-choice items: 




( 1 ) 
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where Ai is the discrimination, B± is the difficulty, and c± is 
the lower asymptote or guessing parameter for item i. 



A generalization of Master's (1982) Partial Credit model was 
used for the c.r. items. This 2PPC model is the same as Muraki's 
(1992) "generalized partial credit model." For a c.r. item with 
nij score levels assigned integer scores that ranged from 
0 to mi - 1: 



and Yi 0 =0. a, is the item discrimination, /y is related to the 
difficulty of the item levels: the trace lines for adjacent score 
levels intersect at Yyl a i • 

Parameter Estimation and Model Predictions of Performance 

Item parameter and 6 estimation was conducted using the 
program PARDUX (Burket, 1991; 1995) . Item parameters were 
estimated using marginal maximum likelihood procedures 
implemented with an EM algorithm. Evaluations of the accuracy of 
the program with simulated data (Fitzpatrick, 1994) have found it 
to be at least as accurate as MULTILOG (Thissen, 1986). The 
ability scale was defined by specifying a prior true 6 
distribution to have a mean of 0.0 and standard deviation of 1.0. 




k = 



( 2 ) 



Z ex P(^) 



where 
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y ik =<x i (k-\)6-Y J r i j > 



Maximum likelihood ability estimates were obtained. For 
reporting purposes, the ability estimates obtained for each test 
were linearly transformed to a scale score metric by multiplying 
by 50 and adding 500. The pattern, though not the number correct 
scale scores, that resulted were expressed to the half point. 

Fit 



Model fit was evaluated with a generalization of the Yen 
(1981) Q, statistic comparing observed and predicted trace lines 
(Fitzpatrick, et al., 1996). The Fit z is a standardization of 
the Qi statistic that facilitates comparisons of items with 
varying numbers of score levels : 



Q ' = JW 



(3) 



The power of z increases with sample size, so for flagging 
purposes the statistic is typically compared to critical values 
that increase with the size of samples. For samples of the size 
used in this study a value of 4.0 was used to flag items for 
misfit . 

Observed and predicted trace lines were also compared 
graphically. Because of the difficulty of interpreting multiple 
trace lines plots for multi-level items, observed item 
performance was compared against predicted performance using the 
item characteristic function: 



nij 



k = I 



( 4 ) 
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Predictions of test performance were made through test 
characterist ic functions obtained by summing item characteristic 
functions : 



nitem 

e ( x , /»,„|«)= (2 Bt-wW'w - 

/=1 A = 1 

where X, is the test score and n, ... T . is the maximum number of 
points in the test. After multiplication by 100 a predicted 
percentage of maximum test score was obtained. 

Information 

Results were evaluated with respect to test score 
information. The pattern scores produced using the 3PL/2PPC 
model utilizes optimal scoring weights, w i , which maximize test 
information. For the 3 PL items, these weights are defined by 
Lord (1980, Section 4.13), and for the 2PPC items the optimal 
weights are a,. When the optimal weights are used, the test 
score information is the sum of item information functions, 
defined as: 



mx,) 



h pm ' 



(5) 



where P’ k {9) is the derivative of P ik (9) with respect to 9 . Test 
score information is subsequently: 






/'= 1 



f [PM? 

h pm ' 



( 6 ) 



It is also possible to base the ability estimate on an 
unweighted sum of item scores. [In fact it is possible to base the 



trait estimate on any arbitrary set of item weights.] Following 
the logic of Lord (1980, Equation 5-3), the information of the 
unweighted raw score is 



/ 



n mj 

£E(*-i)W 



/'=! *=1 






(7) 



1 = 1 

For a given model, the information in the unweighted raw score is 
less than or equal to the information of the optimally weighted 
score . 

Evaluation of Score Precision and Bias Due to Rounding 

The error associated with integer-rounded scores was 
evaluated through comparisons of model predictions, specifically 
test score information/standard errors of ability estimates and 
predicted pm' s, as well as comparisons of ability estimates 
obtained from a single sample of students taking each of the four 
tests. Consequently student responses to the items of a test 
differed across the type of rounding condition only in the manner 
in which ratings for the c.r. items were rounded (responses to all 
other items were identical). The two types of rounded c.r. item 
scores were each utilized in the estimation of a pattern and 
number-correct test score, resulting in four combinations of type 
of c.r. item score (half-point versus integer) and ability 
estimate (pattern score versus number-correct). 
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RESULTS 



Raw Score Statistics 

Descriptive statistics for the four tests are presented in 
Table 1. The three grade 11 tests were moderately difficult, with 
means expressed as observed percent of maximum scores ranging 
between 50% and 57%. The Math/Grade 5 test was more difficult, 
with students, on average, obtaining 35% of the total 75 points 
when either integer-rounded or half-point-rounded c.r. scores were 
used . 

The mean of the total scores containing the integer-rounded 
c.r. item scores was, as expected, higher than the mean total 
score containing half-point c.r. item scores for each of the four 
tests. The increase ranged between .13 (21.73 - 21.60) for the 
Reading/Grade 11 test with its single c.r. item to .79 for the 
Science/Grade 11 test with its eight c.r. items. The difference 
in total scores is attributed solely to the differences in c.r. 
scores induced by the type of rounding (e.g. the difference of .13 
between the mean integer-rounded c.r. score and the mean half- 
point-rounded c.r. score {.79 versus .66, respectively}). 

Scaling Results 

All items in each of the four forms were calibrated twice 
with the 3pl/2ppc model, once using half-point-rounded c.r. item 
scores (and student responses to all other items) and the other 
time with integer-rounded scores. 

The largest number of misfitting items within the eight item 
calibrations (two types of rounding times four tests) was six for 
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the half-point-rounded Math/Grade 5 calibration (average z=6.53), 
followed by five for the integer-rounded estimates for the same 
tryout form (average z = 6.37). The largest absolute difference 
between a predicted and observed p-value for these 11 misfitting 
sets of item parameter estimates was .006. 

Two of the remaining six item calibrations (integer-rounded 
item parameter estimates for Math/Grade 11 and half-point-rounded 
estimates for Reading/Grade 11) had two misfitting items each 
and the other four had only a single misfitting item. The 
largest absolute deviation between observed and predicted p- 
values for the misfitting items in these six calibrations was 
. 01 . 

No c.r. item, when calibrated with integer-rounded or half- 
point-rounded c.r. scores, misfit. 

Information 

Figures 1 through 4 contain plots of the score information 
functions of the four combinations of ability estimate by type of 
rounded c.r. item score for Reading/Grade 11, Science/Grade 11, 
Math/Grade 11, and Math/Grade 5, respectively. Presented below 
the plots of information are the reciprocal values of the four 
score information functions, the standard error of ability 
estimates for scale score intervals of 25 points between 300 and 
700, inclusive. A frequency distribution of the number correct, or 
unweighted half-point-round ability estimates, permits an 
assessment of relatively how many examinees falls at each of the 
scale score values. 
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Pattern Ability Estimates 

All four score information plots reveal that pattern 
estimates provide the most inf ormat ion (and least error), 
regardless of type of rounding. This is expected, due to the 
greater efficiency of pattern scoring. The plots of score 
information for the pattern scores are very nearly coincident for 
all four tests. An evaluation of the tabled s.e.'s indicates 
that, with the exception of the lower portion of the scale score 
range for Reading/Grade 11 (up through 425), the difference 
between the s.e.'s for integer vs half-point pattern ability 
estimates is no more than five points, and very frequently no more 
than two points. Hence, there appears to be no substantive 
difference in the precision of the two types of scores through 
most of the score ranges for the four tests. 

The lower portion of the scale score range for Reading/Grade 
11 demonstrates markedly smaller s.e.'s for the integer-rounded 
pattern scale scores relative to the half-point-rounded scores, 
however. At the floor for this test, a scale score of 300, the 
integer-rounded s.e. is 69 points less than that for the half- 
point s.e. (132 versus 201) and remains 14 points less at a scale 
score of 400. The 69 point difference at the floor is larger than 
one integer-rounded or half-point-rounded pattern score standard 
deviation (approximately 64 scale score points - Table 2). The 
greater precision of the integer-rounded pattern ability estimates 
implies that half point scores actually degrade the precision of 
measurement in this subrange. 
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Number Correct Ability Estimates 



An evaluation of the information provided by integer-rounded 
versus half -point-rounded ability estimates indicates that when 
the two score information functions appear to differ, i.e. greater 
information for integer scores in the middle of the range for 
Math/Grade 11 and Math/Grade 5, differences in s.e.'s are not 
great. Differences in s.e.'s between 425 and 625 for Math/Grade 
11 and between 450 and 600 for Math/Grade 5 are most frequently 
only one or two scale score points. 

Nonnegligible differences in the precision of integer-rounded 
versus half-point-rounded number correct (# correct) ability 
estimates are limited to the lower part of the ability range. 
Integer-rounded number correct ability estimates, like their 
pattern counterparts, have substantially less error than half- 
point number correct ability estimates for this particular 
subrange of the Reading/Grade 11 test. At the floor the integer- 
rounded s.e. is smaller than the half-point-rounded s.e. by 124 
points (212 versus 336; almost twice the approximately 64 point 
half-point and integer-rounded number correct standard deviations 
- Table 3) and is still seven points less at 425. 

Half-point-rounded number correct estimates have marginally 
less error than integer-rounded estimates in the lower subranges 
for Science/Grade 11 and Math/Grade 5. The half-point s.e. of 112 
at 300 is 17 points less than the integer-rounded s.e. of 129 for 
the Science test. The difference is reduced to three points by 
375, however. A difference of 12 scale score points at the floor 
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of the Math/Grade 5 test (116 for half-point versus 128 for 
integer) has similarly been reduced to three points at 375. 

Predicted Percentage of Maximum Score 

Pattern Ability Estimates 

The predicted pm' s are provided at the selected scale score 
points for the four combinations of composite scores in Figures 1 
through 4 . Integer-rounded pm' s tend to be slightly larger than 
half-point-rounded pm' s, with most of the differences less than 
the 2.2 percentage point difference found at 500 to 525 for 
Science/Grade 11. Predicted pm' s are actually slightly smaller 
(largest difference of .2) for integer-rounded ability estimates 
in the lower 300 to 375 subrange for Math/Grade 11 (e.g. 18.3 for 
integer-rounded c.r. scores at 325 versus 18.5 for half-point- 
rounded c.r. scores). The latter exception demonstrates that the 
effect of rounding to an integer does not necessitate a positive 
bias on test scores throughout the scale score range. 

Number Correct Ability Estimates 
Differences between predicted integer versus half-point- 
rounded c.r. scores are larger, though again differences are not 
invariably in favor of the integer scores. The largest bias is 
7.8 percentage points found at 500 scale score points for 
Math/Grade 11. Above 600 to the ceiling of 700 for the Math/Grade 
5 test integer-rounded predicted pm' s are .2 to .3 smaller than 
half-point-rounded predicted pms's (e.g. 90.9 versus 91.1 at 675). 
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Comparisons of Ability Estimates within Examinees 



Differences in scale score ability estimates produced for 
each examinee by integer versus half-point rounding were evaluated 
at each possible half point difference between the total raw c.r. 
scores (integer minus half-point) . The range of possible 
differences in the total c.r. scores produced through integer- 
rounding and half-point rounding could vary between -.5 and +.5 
times the number of c.r. items (including the null or zero 
difference) . Not all possible differences were observed for each 
test. Differences for each type of ability estimate were 
evaluated. 

Pattern Ability Estimates 

Table 2 contains mean pattern scale score differences for 
various differences in total c.r scores. For the Reading/Grade 11 
test, only two out of the three possible differences that could 
occur with a test containing a single c.r. item actually occurred: 
0 and +5. The overall or sample mean difference at the bottom of 
the mean difference column was .05 (s.d. = 1.70). The mean 
difference between scale score estimates based on the two types of 
rounding (again integer minus half-point) for the 336 examinees 
who attained a +.5 difference in the total c.r. scores was 1.78 
with the largest difference being 25 and ,the smallest difference 
being -.5 scale score points. These differences can be evaluated 
relative to the mean half-point pattern standard error for these 
336 examinees: 20.32. 
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The other three tests have a larger number of differences 
between the total c.r. raw scores, as expected given the six to 11 
c.r. items in these tests. The distributions of accumulated half 
point differences varies over the three tests, with the 
Science/Grade 11 test most asymmetric in having seven positive 
differences (every half point from .5 to 3.5) versus only two 
negative differences (-.5 and -1.0). The two Math tests are 
similar in having more approximately equal numbers of positive and 
negative differences, with the Grade 11 test having a wider range 
of differences (all possible half point differences for the six 
c.r. item test ) . 

Distributions for all four tests exhibit a large number of 
zero differences in the total c.r. scores. The preponderance of 
differences are positive which is expected given the frequent 
rounding up, from a half point score to an integer, of an average 
score obtained when two readers differed by a point. The 
percentages of the four total samples that demonstrate negative 
total c.r. differences (i.e those students having at least one 
more average c.r. item score that is larger when rounded to the 
half point than when rounded to the integer) ranges from a low of 
0% for Reading/Grade 11 to a maximum of 13% for Math/Grade 5 (11% 
for -.5 plus 2% for a -1.0 total c.r. difference). 

Similar to the Reading/Grade 11 test, the overall mean 
differences for the other three forms were small, ranging between 
-.24 for Science/Grade 11 through .82 for Math/Grade 5. Mean 
scale score differences at each difference in the total c.r. 
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scores are not large but generally increase (decrease) with 
increases (decreases) in the difference in total c.r scores (e.g. 
from 2.20 to 6.17 as the difference in total c.r. score increases 
from .5 to 2.0 for Math/Grade 5). 

Minimum or maximum within-examinee differences in scale score 
pattern estimates can be as large (in absolute value) as 35.5 at a 
total c.r. difference of 1.5 for Science/Grade 11. This value is 
more than one half of a pattern half-point (ph)or integer standard 
deviation (61.06 versus 61.11) and more than twice the mean ph 
standard error at that total c.r difference (16.10). 

Number Correct Ability Estimates 

Within-subj ect differences in number correct estimates in 
Table 3 are larger than the differences in pattern estimates but 
again, not large on average. Sample mean differences range 
between -.35 for Science/Grade 11 and 1.30 for Math/Grade 5. The 
largest maximum or minimum difference in number correct scale 
scores is -59 at a total c.r. score difference of -1.0 for 
Math/Grade 11. 



DISCUSSION 

The lack of a substantive improvement in measurement 
precision that could be attributed to half-point rounding, coupled 
with the documented instance of increased error induced by that 
type of rounding in a portion of the ability range of students 
taking one test (Reading/Grade 11), would seem to argue for 
rounding average c.r. ratings to the nearest integer. It can not 
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be assumed that half point c.r. scores will always meaningfully 
discriminate examinees on the ability being assessed. 

Rounding up to the nearest integer gives a preponderance of 
students the "benefit of the doubt" concerning the acceptability 
of their response. That is, those students obtaining two readings 
that differ by a point, consequently receiving a half point 
average score (e.g. 1.5 or 2.5), are awarded the greater integer 
score. Those students that require a third reading of their 
response and obtain an average with a remainder of 2/3' s (e.g. 
three readers specifying a 0, 2 and 3 that averages 1.67) will 
obtain the closest integer to their unrounded score. 

Finally, the use of integer-rounded as opposed to half-point- 
rounded c.r. scores has the important advantage of ensuring that 
final c.r. scores can be interpreted relative to specified levels 
of the item rubrics. A meaning of a half-point c.r. score, even 
if it served to discriminate examinees on the trait, would have to 
be "interpolated" between the rubric levels. 

A decision to round average scores to the nearest integer 
would, however, result in a relatively small percentage of the 
examinee population (under 15%, given tests similar in the 
relative proportion of c.r. items to those studied) obtaining a 
scale score that was, on average, slightly reduced relative to 
what would be obtained if rounding to the half point occurred. 

The largest average mean scale score reduction for a group of 
students that would score lower with integer-rounded c.r. scores 
(those having negative differences in the total c.r. scores) was 
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-9.93 for pattern ability estimates and -13.43 for number correct 
estimates, although individual decreases as large as -30.0 and 
-59 scale score points were noted for these two types of 
estimates, respectively. 

These latter two differences, while large relative to the 
sample scale score standard deviations, are not substantially 
larger than their standard errors. Only one student (taking the 
Math/Grade 5 test) across all four test samples that had a 
negative total c.r. difference also had a difference in pattern 
scale score estimates that exceeded one (half-point-rounded) s.e.. 
A maximum of eight students in one test sample (Math/Grade 11) had 
a difference in number correct scale scores that was in excess of 
one (half-point) number-correct s.e. with five in Math/Grade 5 and 
two in Science/Grade 11 also having differences larger than a 
standard error. 

In terms of raw score points examinees doing worse under 
integer-rounded scoring lose 1/3 of a raw score point for every 
-.5 difference in total c.r. scores. This occurs when they have 
an average rating for a c.r. item with a 1/3 remainder that gets 
rounded down to the lower integer instead of to the closer half 
point. Consequently the single student who attained a -3.0 
difference in total c.r. scores for Math/Grade 11 gives up the 
most in raw score points by rounding to integers (1/3 times six 
half points or 2 raw score points), though this student's 
difference in pattern scale scores is a relatively small 12 scale 
score points. 
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It should be noted that a policy of rounding to half score 
points instead of integers results in reductions, albeit smaller 
in magnitude, in raw score points for some examinees. These 
students are those that obtained an average score with a 2/3' s 
remainder and lose the difference of approximately .17 between 
2/3' s and that accompanies half-point rounding. 

The number of students incurring a reduction in raw score due 
to either integer-rounding or half-point-rounding (or the 
magnitude of the reductions) can not be determined from Tables 2 
and 3. A half-point difference between an average item score 
rounded to an integer could have been compensated for by a half 
point loss due to the other type of rounding (excluding those 
examinees who obtained the maximum or minimum possible difference 
in total c.r. scores). If the probability the rating process 
produced an average score with a remainder of 1/3 was the same as 
that of producing an average score with remainder of 2/3' s there 
would be as many instances of losses due to half-point as integer- 
rounding. Hence the magnitude of the raw score reduction, summed 
over examinees, for half-point rounding would be twice that for 
integer-rounding (approximately .33 times the number of students 
impacted versus approximately .17 times a putative equal number of 
students) Unfortunately some of the tests studied here contradict 
that rating process assumption (e.g. Reading/Grade 11). 
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CONCLUSIONS 



There was little substantive difference in score information or 
the s.e.'s of ability estimates due to type of rounding, 
integer versus half-point, above the floors of three out of the 
four tests studied. 

In the fourth, Reading/Grade 11 test there was decidedly less 
error (more precision) in the integer-rounded ability estimates 
at the lower portion of the ability continuum (from a scale 
score of 300 to approximately 425) . This was true for both 
pattern and number correct estimates. 

Integer-rounded estimates generally produce slightly larger 
predicted percent of maximum (test) scores, though not 
throughout the entire ability range of all of the four test 
studied. The expected larger positive differences or rounding 
bias for number correct estimates were observed. 

Within-sub j ect differences between scale score estimates 
derived using integer versus half-point scores were generally 
small for both pattern and number correct ability estimates. 
Several differences between approximately 29 and 36 scale score 
points for pattern ability estimates and between 50 and 60 
points for number correct ability estimates were observed, 
however. Differences this large were approximately one half 
and one standard deviation of the respective sample standard 
deviations . 

For those students scoring higher with half-point-rounded 
scores, a very small number had within-sub j ect differences in 



integer-rounded versus half-point rounded pattern or number 
correct ability estimates that were as large as one standard 
error in magnitude. None were as large as two s.e.'s. 
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Difference in Scale Scores by Differences in Total CR Score due to Type of Rounding 

Pattern Scoring 
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Table 3 

Difference in Scale Scores by Differences in Total CR Score due to Type of Rounding 
Number Correct Scoring 
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